Introduction

For this project, we used multiple online databases, described below.

  1. tRFdb (http://genome.bioch.virginia.edu/trfdb/index.php)
  1. gtRNAdb (http://gtrnadb.ucsc.edu)
  1. Modomics (http://genesilico.pl/modomics/)
  1. MINTbase (https://cm.jefferson.edu/MINTbase/)

Rationale

Liz, Kavyaa, and I webscraped for information publicly available in multiple databases to determine the size and abundance of tRNA fragments. We are using that raw data to learn more about them in terms of their associated proteins and aim to determine the function of these conserved fragments. Datamining has proven to be a helpful way for us to study tRNA fragments outside of the typical wet lab setting. Our ultimate goal is to use these data to inform our future experiments using TGIRT-seq. Using TGIRT-seq will also give us the greater processivity of TGIRT reverse transcriptases across the difficult structures and modifications of tRNAs and their fragments.

Results/Methods

We primarily used three R scripts. In the first, we loaded in the data from the tRFdb, formatted it to contain the information we were looking for, and wrote it into a csv file so it could be accessed even when the website is down. The second is a loop that allowed us to do the actual webscraping from the tRFdb website to determine the nucleotide counts found in their deep sequencing data that mapped to their tRNA construct for each gene. Finally, we used the information we webscraped to create plots for the visualization of the fragment sizes and abundance.

Prepwork Script

#Here, we read in the data from the tRFdb and formatted it to add columns that contained the start and end of the map positions.
tRFdb_human_tRF3_list <- read_csv("data/trfdb_human_tRF3.csv")
View(tRFdb_human_tRF3_list)

tRFdb_human_tRF3_list_mod <- tRFdb_human_tRF3_list %>% 
  mutate(tRF_start = str_match(`tRF Map Positions`, "(?<=Start:)\\d+"),
         tRF_end = str_match(`tRF Map Positions`, "(?<=End:)\\d+"))
#Then, we wrote it into a csv file and gave it a relative file path. 
write_csv(tRFdb_human_tRF3_list_mod, "data/trfdb_human_tRF3_update.csv")
#Here, we use a reference URL to make a list of all the experimental data
url <- "http://genome.bioch.virginia.edu/trfdb/experiments_display.php?trf_id=3023a&organism=human"
#We read in the html file as a table, and format it as a dataframe, from which we extract the first element of the list. 
experiments <- read_html(url) %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE) %>% 
  .[[1]] 
#Then, we filter for the columns of the dataframe that we are specifically looking at and write it into a csv file, so it can be accessed even when the website is down. 
experiments <- experiments %>% 
  select(Experiment,Source)
View(experiments)
write_csv(experiments, "data/tRFdb_experiment_list.csv")
#Finally, we cut down this whole dataset to look specifically at the data from the experiments we want, which are the HEK and PARCLIP experiments, and write the trimmed list into its own csv file. 
HEK_experiments <- experiments %>% 
  filter(str_detect(Source, "HEK") | str_detect(Source, "PARCLIP"))
View(HEK_experiments)
write_csv(HEK_experiments, "data/HEK_experiment_list.csv")

Loop Script We use this to pull out the gene names and coordinates to build links to webscrape nucleotide abundance data for the tRNA gene we are looking at.

#Here, we are assigning variables to be used later in the script. We place this at the top of the script to make it easier to change the tRNA gene (with or without a specific anticodon) we are looking at. 
tRNA <- "Phe"
formatted_tRF_filename <- "data/tRF-PheGAA-tRFdb-HEK-abundance.csv"
#We are assigning variables that we will be using later to build the links.
HEK_experiment_IDs <- HEK_experiments %>% 
  pull(Experiment)

HEK_experiment_description <- HEK_experiments %>% 
  pull(Source)
#Use this code to peek at the dataset to see how the tRNA names were formatted, so that we can filter for them later on.
tRFdb_human_tRF3_list[1,"tRNA Name"] 
View(tRFdb_human_tRF3_list) 
#Filter the dataset for the tRNA you are specifically looking at and assign its gene coordinates to a variable for later use. 
tRF_gene_coordinates <- tRFdb_human_tRF3_list %>% 
  filter(str_detect(`tRNA Name`, tRNA)) %>% 
  pull(`tRNA Gene Co-ordinates`) %>% 
  unique()
#Filter again, but this time, create a variable that includes the tRNA gene name.
tRF_gene_names <- tRFdb_human_tRF3_list %>% 
  filter(str_detect(`tRNA Name`, tRNA)) %>% 
  pull(`tRNA Name`) %>% 
  unique()
#Here, we assign all the componants of the tRFdb links to smaller variables so that we can use them to build an entire link specific for the genes we are looking at. 
link1 = "http://genome.bioch.virginia.edu/trfdb/"
link2 = "graph.php?exp="
link3 = HEK_experiment_IDs
link4 = "&org=human&part1="
link5 = tRF_gene_coordinates
link6 = "&part2="
link7 = tRF_gene_names
#We use nested for loops to build the links to the gene we want and to read the file and convert it into a table that is readable in R, which we then input into a large dataframe with all of the results of the loops. 
all_HEK_TRF_nuc_freq <- tibble()
x <- tibble()
for (experiment in 1:length(HEK_experiment_IDs)) { # for each of the experiments we're interested
  for (gene in 1:length(tRF_gene_names)) { # for each of the tRNA genes we're interested
    gene_df = paste(link1,link2,link3[experiment],link4,link5[gene],link6,link7[gene],sep = "") %>% # build a url address
    read_html() %>% # read that url into R
    html_table() %>% # convert the R file into a table
    .[[1]] %>% # we specify that want just the first element of the list created by html_table
    as_tibble() %>% 
    mutate("tRNA Gene"=link7[gene],
    "Nucleotide Position"=COUNT,
    "Experiment" = HEK_experiment_description[experiment],
    "Experiment#" = HEK_experiment_IDs[experiment]) # here I'm formatting the dataframe with extra columns
    all_HEK_TRF_nuc_freq = rbind(all_HEK_TRF_nuc_freq, gene_df) # this binds all the dataframes created by the for loops together in one big dataframe
  }}
#Write this new large dataframe into a csv file
#write_csv(all_HEK_TRF_nuc_freq, formatted_tRF_filename)

formatted_tRF_filename <- "data/tRF-His-tRFdb-HEK-abundance.csv"
all_HEK_tRF_freq <- read_csv(formatted_tRF_filename)
all_HEK_tRF_freq
## # A tibble: 6,300 x 6
##    COUNT   RPM `tRNA Gene`       `Nucleotide Position` Experiment `Experiment#`
##    <dbl> <dbl> <chr>                             <dbl> <chr>      <chr>        
##  1     1  110. chr9.trna7-HisGTG                     1 HEK293     GSM416733    
##  2     2  161. chr9.trna7-HisGTG                     2 HEK293     GSM416733    
##  3     3  171. chr9.trna7-HisGTG                     3 HEK293     GSM416733    
##  4     4  183. chr9.trna7-HisGTG                     4 HEK293     GSM416733    
##  5     5  185. chr9.trna7-HisGTG                     5 HEK293     GSM416733    
##  6     6  187. chr9.trna7-HisGTG                     6 HEK293     GSM416733    
##  7     7  187. chr9.trna7-HisGTG                     7 HEK293     GSM416733    
##  8     8  217. chr9.trna7-HisGTG                     8 HEK293     GSM416733    
##  9     9  225. chr9.trna7-HisGTG                     9 HEK293     GSM416733    
## 10    10  231. chr9.trna7-HisGTG                    10 HEK293     GSM416733    
## # … with 6,290 more rows
trfdb_human_tRF3_update2 <- read_csv("data/trfdb_human_tRF3_update2.csv")
all_HEK_tRF_freq_update2 <- left_join(all_HEK_tRF_freq,trfdb_human_tRF3_update2, by = c("tRNA Gene"="tRNA Name")) %>% 
  select(`Nucleotide Position`, Experiment, RPM, gtRNA_ID_abbrev)
all_HEK_tRF_freq_update2
## # A tibble: 12,600 x 4
##    `Nucleotide Position` Experiment   RPM gtRNA_ID_abbrev
##                    <dbl> <chr>      <dbl> <chr>          
##  1                     1 HEK293      110. TRH-GTG-1-6    
##  2                     1 HEK293      110. TRH-GTG-1-6    
##  3                     2 HEK293      161. TRH-GTG-1-6    
##  4                     2 HEK293      161. TRH-GTG-1-6    
##  5                     3 HEK293      171. TRH-GTG-1-6    
##  6                     3 HEK293      171. TRH-GTG-1-6    
##  7                     4 HEK293      183. TRH-GTG-1-6    
##  8                     4 HEK293      183. TRH-GTG-1-6    
##  9                     5 HEK293      185. TRH-GTG-1-6    
## 10                     5 HEK293      185. TRH-GTG-1-6    
## # … with 12,590 more rows

Graph Script - Overlay This script gives us a good idea of the fragments generated from each experiment, including their sizes and abundance, relative to one another.

#Here, we provide the setup information, where we can quickly and easily replace the tRNA name with the one we are looking for. 
formatted_tRF_filename <- "data/tRF-His-tRFdb-HEK-abundance.csv"
graph_name_html <- file.path(getwd(), "graphs-plotly", "tRF-His-tRFdb-HEK-abundance-overlay.html")
all_HEK_tRF_freq <- read_csv(formatted_tRF_filename)
#The, we left join our two datasets to add the correct gtRNA ID's to the plot
all_HEK_tRF_freq_update2<- left_join(all_HEK_tRF_freq,trfdb_human_tRF3_update2, by = c("tRNA Gene"="tRNA Name")) %>% 
  select(`Nucleotide Position`, Experiment, RPM, gtRNA_ID_abbrev)

graph_title <- "tRNA-His fragment coverage"
#Establish colors you want to use in the graph 
color_blind_friendly_cols<-c("#d7191c","#fdae61","#abd9e9","#2c7bb6","#000000")
# Use the variables we had previously created to graph the nucleotide counts (in reads per million reads), colored by each experiment the data came from. 
graph <- all_HEK_tRF_freq_update2 %>%
  ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment, label = `gtRNA_ID_abbrev`))+
  geom_path()+labs(title = graph_title)
graph

#Finally, load into ggplot to make into an interactive graph (so you can mouse over a fragment and look at the exact counts), and view it!
p <- ggplotly(graph)
p

Graph Script - Faceted This script just includes a different way to view the nucleotide counts so that the graph does not look so cluttered. Instead, the graph is recreated for each gene and oriented next to the others.

#Graph the tRF frequencies!
p = all_HEK_tRF_freq %>%
  ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment))+
  geom_line()+
  scale_color_manual(values = color_blind_friendly_cols)+
  facet_grid(`tRNA Gene` ~ .)+
  labs(title = "tRNA-His fragment coverage")
#This line of code graphs only experiments that are *not* from the HEK293 experiments. 
p2 = all_HEK_tRF_freq %>%
  filter(Experiment != "HEK293") %>% 
  ggplot(aes(x=`Nucleotide Position`, y=RPM, color= Experiment))+
  geom_line()+
  scale_color_manual(values = color_blind_friendly_cols)+
  facet_grid(`tRNA Gene` ~ .)+
  labs(title = "tRNA-His fragment coverage")
#Here, we can further format the graph to hide the panel borders and remove the grid lines.
final_plot = p + theme(
  panel.background = element_blank(),
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  axis.line = element_line(colour = "black")
)
#View the graph!
final_plot

In the Histidine plot, we can see very well-defined and uniform fragment sizes, many around 22 nucleotides long, for all the datasets analyzed. Furthermore, we can see a very low association with AGO2 proteins and high association with AGO proteins 3 and 4.

Additional Plots

Lysine CTT

Here, we see a much greater variation in fragment size. Additionally, by comparing the scale of this graph to the previous one, we can see that many more fragments were obtained by the AGO-pull down assays with LysCTT than His. We see that the HEK experiments obtained more fragments, which is expected since that dataset contains the entire small RNA population. Different fragment nucleotide locations are associated in greater abundance with some of the AGO proteins, with AGO3 interacting with 5’tRNA fragments and internal tRNA fragments. Interestingly, the 3’tRNA fragments around position 59-76 have some similarities with AGO-association as tRF-His, with AGO4 proteins having the most associated reads, and few reads associated with AGO2. A difference in this case, though, is that the next highest abundance reads are with AGO1, followed by AGO3 (use the plotly graph, zoom and pan interactive functions to zoom into this region).

## # A tibble: 5,715 x 6
##    COUNT   RPM `tRNA Gene`        `Nucleotide Position` Experiment `Experiment#`
##    <dbl> <dbl> <chr>                              <dbl> <chr>      <chr>        
##  1     1 1874. chr15.trna2-LysCTT                     1 HEK293     GSM416733    
##  2     2 1886. chr15.trna2-LysCTT                     2 HEK293     GSM416733    
##  3     3 2008. chr15.trna2-LysCTT                     3 HEK293     GSM416733    
##  4     4 2082. chr15.trna2-LysCTT                     4 HEK293     GSM416733    
##  5     5 2108. chr15.trna2-LysCTT                     5 HEK293     GSM416733    
##  6     6 2152. chr15.trna2-LysCTT                     6 HEK293     GSM416733    
##  7     7 2170. chr15.trna2-LysCTT                     7 HEK293     GSM416733    
##  8     8 2191. chr15.trna2-LysCTT                     8 HEK293     GSM416733    
##  9     9 2204. chr15.trna2-LysCTT                     9 HEK293     GSM416733    
## 10    10 2219. chr15.trna2-LysCTT                    10 HEK293     GSM416733    
## # … with 5,705 more rows
## # A tibble: 11,430 x 4
##    `Nucleotide Position` Experiment   RPM gtRNA_ID_abbrev
##                    <dbl> <chr>      <dbl> <chr>          
##  1                     1 HEK293     1874. TRK-CTT-1-2    
##  2                     1 HEK293     1874. TRK-CTT-1-2    
##  3                     2 HEK293     1886. TRK-CTT-1-2    
##  4                     2 HEK293     1886. TRK-CTT-1-2    
##  5                     3 HEK293     2008. TRK-CTT-1-2    
##  6                     3 HEK293     2008. TRK-CTT-1-2    
##  7                     4 HEK293     2082. TRK-CTT-1-2    
##  8                     4 HEK293     2082. TRK-CTT-1-2    
##  9                     5 HEK293     2108. TRK-CTT-1-2    
## 10                     5 HEK293     2108. TRK-CTT-1-2    
## # … with 11,420 more rows

Tyrosine

With Tyr, we see many 22 and 25 nucleotide tRNA fragments that associate with AGO4 and AGO3 proteins in high abundance, higher than those acquired from the HEK experiments. Additionally, we see a higher association with AGO2 with Tyr than the other two tRNA fragment types shown previously.

## # A tibble: 6,530 x 6
##    COUNT   RPM `tRNA Gene`       `Nucleotide Position` Experiment `Experiment#`
##    <dbl> <dbl> <chr>                             <dbl> <chr>      <chr>        
##  1     1  419. chr8.trna5-TyrGTA                     1 HEK293     GSM416733    
##  2     2  421. chr8.trna5-TyrGTA                     2 HEK293     GSM416733    
##  3     3  422. chr8.trna5-TyrGTA                     3 HEK293     GSM416733    
##  4     4  422. chr8.trna5-TyrGTA                     4 HEK293     GSM416733    
##  5     5  423. chr8.trna5-TyrGTA                     5 HEK293     GSM416733    
##  6     6  424. chr8.trna5-TyrGTA                     6 HEK293     GSM416733    
##  7     7  424. chr8.trna5-TyrGTA                     7 HEK293     GSM416733    
##  8     8  426. chr8.trna5-TyrGTA                     8 HEK293     GSM416733    
##  9     9  426. chr8.trna5-TyrGTA                     9 HEK293     GSM416733    
## 10    10  426. chr8.trna5-TyrGTA                    10 HEK293     GSM416733    
## # … with 6,520 more rows
## # A tibble: 12,320 x 4
##    `Nucleotide Position` Experiment   RPM gtRNA_ID_abbrev
##                    <dbl> <chr>      <dbl> <chr>          
##  1                     1 HEK293      419. TRY-GTA-5-2    
##  2                     1 HEK293      419. TRY-GTA-5-2    
##  3                     2 HEK293      421. TRY-GTA-5-2    
##  4                     2 HEK293      421. TRY-GTA-5-2    
##  5                     3 HEK293      422. TRY-GTA-5-2    
##  6                     3 HEK293      422. TRY-GTA-5-2    
##  7                     4 HEK293      422. TRY-GTA-5-2    
##  8                     4 HEK293      422. TRY-GTA-5-2    
##  9                     5 HEK293      423. TRY-GTA-5-2    
## 10                     5 HEK293      423. TRY-GTA-5-2    
## # … with 12,310 more rows

Future Directions

Our next step is to analyze the tRFdb data using our own alignment pipeline. From here, we will be able to be more forgiving with our alignment to include mutations added by the reverse transcriptase. We can compare our results to those of tRFdb to determine whether a significant amount of tRNA fragments were thrown out due to the strict pipeline they used. This is important because reverse transcriptases are not always able to read through modification sites and assign the correct nucleotide, so the tRFdb site may be throwing out reads, since they will not match the alignment construct with 100% accuracy. From there, we can look at the MODOMICS website to determine whether our pipeline was better able to retain fragments that overlapped with a known m1A modification found at the 3’end of tRNAs.

References

Hafner, Markus, Markus Landthaler, Lukas Burger, Mohsen Khorshid, Jean Hausser, Philipp Berninger, Andrea Rothballer, et al. “Transcriptome-Wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP.” Cell 141, no. 1 (April 2, 2010): 129–41. https://doi.org/10.1016/j.cell.2010.03.009.

Kumar, Pankaj, Jordan Anaya, Suresh B. Mudunuri, and Anindya Dutta. “Meta-Analysis of TRNA Derived RNA Fragments Reveals That They Are Evolutionarily Conserved and Associate with AGO Proteins to Recognize Specific RNA Targets.” BMC Biology 12, no. 1 (October 1, 2014): 78. https://doi.org/10.1186/s12915-014-0078-0.

Kumar, Pankaj, Suresh B. Mudunuri, Jordan Anaya, and Anindya Dutta. “TRFdb: A Database for Transfer RNA Fragments.” Nucleic Acids Research 43, no. Database issue (January 28, 2015): D141–45. https://doi.org/10.1093/nar/gku1138.

Mayr, Christine, and David P. Bartel. “Widespread Shortening of 3′UTRs by Alternative Cleavage and Polyadenylation Activates Oncogenes in Cancer Cells.” Cell 138, no. 4 (August 21, 2009): 673–84. https://doi.org/10.1016/j.cell.2009.06.016.